The pml.training data has 19.622 rows and 159 columns, on the other hand the pml.testing data has 20 rows and 159 columns. The head of the data pml.trainind is presented below,
The variable to predict is “classe”, its distribution is,
##
## A B C D E
## 5580 3797 3422 3216 3607
In order to identify the null values of each columns, the following graph was made,
A table with the same information of the plot is presented bellow, this table shows the name of the variable, the sum of the NA values and their percentage with respect to the total rows of the data,
In order to have variables that provide me with good information and due to the high number of na in some columns, they were eliminated, in this process 67 variables were eliminated. The same changes were applied to the test data, after this process the new data is,
After this, we proceeded to review the type of variable of each column, and it was observed that some columns were of the character type when in fact they were numeric, this due to some errors in these columns. After converting these variables to numerical, the analysis was performed again on their NA and the following graph was obtained,
A table with the same information of the plot is presented bellow,
As before, these variables were eliminated, the new number of columns of the data is 59, a glimpse of it is shown below,
Once the definitive data was obtained, we proceeded to see the correlation between the numerical variables and the variable “classe”, the following table shows these values,
The table presented is the last row of the correlation matrix of our data, according to the values in it, no type of strong correlation is observed between the variable to be predicted and the predictors, which is good.
In order to fit a model to the data, it was separated from the previously discussed data into two, one for training and the other for testing. The dimension of both datas are,
#REMOVE AUX VARIABLE
pml.training <- pml.training[,-60]
set.seed(62433)
#CREATE TRAIN AND TEST DATA
inTrain = createDataPartition(pml.training$classe, p = 3/4)[[1]]
training = pml.training[ inTrain,]
testing = pml.training[-inTrain,]
dim(training)
## [1] 14718 59
dim(testing)
## [1] 4904 59
In order to find and fit a good model to the data, two options were considered, they are; the LDA and Rpart method of the caret package. To evaluate their performance, a cross validation of 10 folds was carried out, at the end of which the precision value of each model was obtained,
#CREATE FOLDS - CROSS VALIDATION
folds <- createFolds(training$classe, k = 10)
#CROSS VALIDATION - LDA
cv <- lapply(folds, function(x){
training_fold <- training[-x, ]
test_fold <- training[x, ]
clasificador <- train(classe ~ ., data = training_fold, method = "lda")
y_pred <- predict(clasificador, newdata = test_fold)
precision <- confusionMatrix(data = y_pred, reference = test_fold$classe)$overall[1]
return(precision)
})
precision_cv<- mean(as.numeric(cv))
#CROSS VALIDATION - RPART
cv_rpart <- lapply(folds, function(x){
training_fold <- training[-x, ]
test_fold <- training[x, ]
clasificador <- train(classe ~ ., data = training_fold, method = "rpart")
y_pred <- predict(clasificador, newdata = test_fold)
precision <- confusionMatrix(data = y_pred, reference = test_fold$classe)$overall[1]
return(precision)
})
precision_rpart<- mean(as.numeric(cv_rpart))
z <- as.data.frame(matrix(0,2,2))
names(z) <- c("Model","Accuracy")
z$Model <- c("LDA","Rpart")
z$Accuracy <- c(precision_cv,precision_rpart)
z
## Model Accuracy
## 1 LDA 0.8541236
## 2 Rpart 0.4929404
According to the previous table, it can be stated that the best model is the one obtained by the LDA method of the caret package.
The confusion matrix associated with this model, is presented bellow,
pred2 <- predict(mod2, testing)
confusionMatrix(data = pred2, reference = testing$classe)$table
## Reference
## Prediction A B C D E
## A 1279 103 3 0 0
## B 97 704 96 4 0
## C 19 137 731 87 7
## D 0 5 23 663 86
## E 0 0 2 50 808
Finally the predictions made for our pml.testing data which has 20 observations is,
pred <- predict(mod2, pml.testing)
pred
## [1] B B B A A E D C A A B C B A E E A B B B
## Levels: A B C D E